{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TiDB Vector Store\n",
    "\n",
    "> [TiDB Cloud](https://www.pingcap.com/tidb-serverless/), is a comprehensive Database-as-a-Service (DBaaS) solution, that provides dedicated and serverless options. TiDB Serverless is now integrating a built-in vector search into the MySQL landscape. With this enhancement, you can seamlessly develop AI applications using TiDB Serverless without the need for a new database or additional technical stacks. Create a free TiDB Serverless cluster and start using the vector search feature at https://pingcap.com/ai.\n",
    "\n",
    "This notebook provides a detailed guide on utilizing the tidb vector search in LlamaIndex."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up environments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-vector-stores-tidbvector\n",
    "%pip install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import textwrap\n",
    "\n",
    "from llama_index.core import SimpleDirectoryReader, StorageContext\n",
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.vector_stores.tidbvector import TiDBVectorStore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Configuring your OpenAI Key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import getpass\n",
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"Input your OpenAI API key:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Configure TiDB connection setting that you will need. To connect to your TiDB Cloud Cluster, follow these steps:\n",
    "\n",
    "- Go to your TiDB Cloud cluster Console and navigate to the `Connect` page.\n",
    "- Select the option to connect using `SQLAlchemy` with `PyMySQL`, and copy the provided connection URL (without password).\n",
    "- Paste the connection URL into your code, replacing the `tidb_connection_string_template` variable.\n",
    "- Type your password."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace with your tidb connect string from tidb cloud console\n",
    "tidb_connection_string_template = \"mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true\"\n",
    "# type your tidb password\n",
    "tidb_password = getpass.getpass(\"Input your TiDB password:\")\n",
    "tidb_connection_url = tidb_connection_string_template.replace(\n",
    "    \"<PASSWORD>\", tidb_password\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Prepare data that used to show case"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document ID: 86e12675-2e9a-4097-847c-8b981dd41806\n"
     ]
    }
   ],
   "source": [
    "documents = SimpleDirectoryReader(\"./data/paul_graham\").load_data()\n",
    "print(\"Document ID:\", documents[0].doc_id)\n",
    "for index, document in enumerate(documents):\n",
    "    document.metadata = {\"book\": \"paul_graham\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create TiDB Vectore Store\n",
    "\n",
    "The code snippet below creates a table named `VECTOR_TABLE_NAME` in TiDB, optimized for vector searching. Upon successful execution of this code, you will be able to view and access the `VECTOR_TABLE_NAME` table directly within your TiDB database environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "VECTOR_TABLE_NAME = \"paul_graham_test\"\n",
    "tidbvec = TiDBVectorStore(\n",
    "    connection_string=tidb_connection_url,\n",
    "    table_name=VECTOR_TABLE_NAME,\n",
    "    distance_strategy=\"cosine\",\n",
    "    vector_dimension=1536,\n",
    "    drop_existing_table=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a query engine based on tidb vectore store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "storage_context = StorageContext.from_defaults(vector_store=tidbvec)\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context, show_progress=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: If you encounter errors during this process due to the MySQL protocol’s packet size limitation, such as when trying to insert a large number of vectors (e.g., 2000 rows) , you can mitigate this issue by splitting the insertion into smaller batches. For example, you can set the `insert_batch_size` parameter to a smaller value (e.g., 1000) to avoid exceeding the packet size limit, ensuring smooth insertion of your data into the TiDB vector store:\n",
    "\n",
    "```python\n",
    "storage_context = StorageContext.from_defaults(vector_store=tidbvec)\n",
    "index = VectorStoreIndex.from_documents(\n",
    "    documents, storage_context=storage_context, insert_batch_size=1000, show_progress=True\n",
    ")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Semantic similarity search\n",
    "\n",
    "This section focus on vector search basics and refining results using metadata filters. Please note that tidb vector only supports Deafult VectorStoreQueryMode."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The author wrote a book.\n"
     ]
    }
   ],
   "source": [
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What did the author do?\")\n",
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filter with metadata\n",
    "\n",
    "perform searches using metadata filters to retrieve a specific number of nearest-neighbor results that align with the applied filters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Empty Response\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.vector_stores.types import (\n",
    "    MetadataFilter,\n",
    "    MetadataFilters,\n",
    ")\n",
    "\n",
    "query_engine = index.as_query_engine(\n",
    "    filters=MetadataFilters(\n",
    "        filters=[\n",
    "            MetadataFilter(key=\"book\", value=\"paul_graham\", operator=\"!=\"),\n",
    "        ]\n",
    "    ),\n",
    "    similarity_top_k=2,\n",
    ")\n",
    "response = query_engine.query(\"What did the author learn?\")\n",
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Query again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The author learned valuable lessons from his experiences.\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.vector_stores.types import (\n",
    "    MetadataFilter,\n",
    "    MetadataFilters,\n",
    ")\n",
    "\n",
    "query_engine = index.as_query_engine(\n",
    "    filters=MetadataFilters(\n",
    "        filters=[\n",
    "            MetadataFilter(key=\"book\", value=\"paul_graham\", operator=\"==\"),\n",
    "        ]\n",
    "    ),\n",
    "    similarity_top_k=2,\n",
    ")\n",
    "response = query_engine.query(\"What did the author learn?\")\n",
    "print(textwrap.fill(str(response), 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Delete documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tidbvec.delete(documents[0].doc_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check whether the documents had been deleted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Empty Response\n"
     ]
    }
   ],
   "source": [
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What did the author learn?\")\n",
    "print(textwrap.fill(str(response), 100))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
